Prosper’s data set by Yann Dupis

In this exploratory analysis we will explore a dataset from the company Prosper, who is part of the peer-to-peer lending industry. In this analysis we want to answer the following questions:

Univariate analysis

Multivariate Analysis

Characteristics of the dataset

Let’s load the data set:

What is the structure of the dataset?

## [1] 113937     81

This data set contains 113,937 loans with 81 variables, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, Prosper score and the latest payment information.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Univariate Plots Section

What are the characteristics of the loan (loan amount, term, etc)?

In this section we would like to analyze the characteristics of the loan provided on Prosper’s platform, in terms of loan origination date, loan amount, term, monthly payment, and current loan status.

## [1] 84672    81

There are 84,672 bonds listed since July 2009.

Loan orgination amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9094   13750   35000

The minimum loan amount is $1000, the maximum is $35,000 and the average loan amount is $8,337. However the median is $6,500. The difference between the average and median is proably due to a distribution skewed to the right.

When we look at the distribution we observe that there are peaks every $5000 ($5000, $10000, $15,000, $20,000, and $25,000). However it’s interesting to note there is a large number of loans with laon amounts equal to $4,000. One of the reasons why $4,000 loans are popular is because Prosper makes it harder to get loans over than $4,000.

Term

The length of the loan expressed in months.

Prosper provides mostly 36 and 60-month loans.

Monthly Loan Payment

summary(pr$MonthlyLoanPayment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   157.6   252.2   292.3   389.0  2252.0

Most loans have a mothly payment between $131.6 and $371.6. It’s very surprising that some loans have monthly payments equal to 0. There is probably a data quality issue as it shouldn’t be possible to have monthly payments equal to 0.

## Warning: Removed 11637 rows containing non-finite values (stat_bin).

If we change the binwidth to 1 and limit the x axis to $500 we can observe the most common monthly payment is around $175. This monthly payment probably corresponds to a $4000 loan.

Borrower Rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.1359  0.1875  0.1961  0.2574  0.3600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The Borrower Rate seems quite uniformly distributed.

LoanStatus

## 
##              Cancelled             Chargedoff              Completed 
##                      0                   5326                  19501 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                    997                    205 
##   Past Due (1-15 days)  Past Due (16-30 days)  Past Due (31-60 days) 
##                    806                    265                    363 
##  Past Due (61-90 days) Past Due (91-120 days)   Past Due (>120 days) 
##                    313                    304                     16

Most of the loans are active but some have late payments.

What’s the percentage of people who defaulted?

## [1] 0.01177485

Around 11.7% of loans have defaulted.

What’s the percentage of people who had late payments?

## [1] 0.02441185
## 
##              Cancelled             Chargedoff              Completed 
##                      0                   5326                  19501 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                    997                    205 
##   Past Due (1-15 days)  Past Due (16-30 days)  Past Due (31-60 days) 
##                    806                    265                    363 
##  Past Due (61-90 days) Past Due (91-120 days)   Past Due (>120 days) 
##                    313                    304                     16

Around 2.5% of loan have late payments.

ProsperScore

A custom risk score was created by Prosper in order to assess risk. The score ranges from 1-11, with 11 being the best, or lowest, risk score.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.945   8.000  11.000
## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   989  5766  7639 12594  9807 12271 10581 12008  6843  4718  1456

Most of the loans have a score around 6.

Why people use Prosper?

People use Prosper mostly for Debt Consolidation (1), Other (4), Home Improvement (2), or Business (3).

Characteristics of people who use Prosper.

Income range

What’s the distribution of the IncomeRange of Propsper’s users? Most of the people using Prosper have an income between $25000 and $75000.

Debt Income Ratio

What’s the distribution of the debt to income ratio?

summary(pr$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.150   0.220   0.259   0.320  10.010    7281
## Warning: Removed 7453 rows containing non-finite values (stat_bin).

The debt to income ratio is skewed to the right with a median of 0.22.

Credit Score

What’s the distribtution of the credit score?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   600.0   660.0   700.0   699.4   720.0   880.0

The distribution of the credit score is centered around 700.

What’s the distribtution of Monthly Income?

Monthly Income

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3442    5000    5933    7083 1750000
## Warning: Removed 8163 rows containing non-finite values (stat_bin).

On average people who use Prosper have a monthly income of $5,935.

Univariate Analysis

There are lots of features in the dataset that describe characteristics of the loan, like the monthly payment, loan origination amount, borrower rate, term, etc. It would be interesting to see if the monthly payment amount remains the same for a same loan origination amount and term.

Prosper has a Prosper score that quantifies the loan risk. It would be interesting to see if it has an impact on the montly payment for a same loan origination amount and term.

We can also try to identify if there are characteristics of the borrower that influence the Prosper score.

Finally we can explore how Prosper’s portfolio has evolved.

We have observed that certain loans have a monthly payment equal to 0. There must be a data qualilty issue. For the rest of the analysis, we will exclude these loans.

Bivariate Plots Section

What’s the number of loans issued by loan origination date?

The number of loans issued by loan orgination date have highly increased between 2009 and 2014 even though there was a drop in 2013. Most of the loans issued had a term of 36 months. Prosper issued 60-month loans for the first time in 2011. They stopped issuing 12-month loans in 2013.

Correlation matrix

cor.test(pr$CreditScoreRangeLower,pr$ProsperScore)
## 
##  Pearson's product-moment correlation
## 
## data:  pr$CreditScoreRangeLower and pr$ProsperScore
## t = 115.18, df = 84188, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3631118 0.3747826
## sample estimates:
##       cor 
## 0.3689617

Bivariate Analysis

There are lots of variables correlated in the data set. The most correlated variables (corr = 0.911) are loan origination amount and monthly loan payment, which is not surprising.

Borrower rate is negatively correlated with Prosper score and credit score.

Debt income ratio is slighlty correlated with credit score.

It’s interesting to highlight that credit score and Prosper score are highly correlated (0.386). However, an important variance remains unexplained, which means that Prosper doesn’t only use the credit score to assess risk.

Prosper score is also correlated with loan origination amount and monthly payment.

What’s the average loan amount and number of loans by origin date for each term?

We can explore how the number of loans, loan origination amount, and monthly payment have evolved between 2009 and 2014.

To do so, we can create three metrics: number of loans, average monthly payment, and average loan origination amount by loan origination date and Term. The average loan origination amount has increased since 2009 for loans with terms of 12 and 36 monthly, until Prosper stopped offering 12-month loans in 2013.

The 60-month loans have started with a very high loan origination amount in 2011, dropped in 2012, then increased steadily in 2013 and 2014.

What’s the average loan amount and average monthly payment by origin date for each term?

Monthly payment and loan amount

There are three clusters of points: one is the 12-month loan, another is the 30-month loan, and the last cluster is the 60-month loan.

We can try to explain the variance.

We observe that the variance is explained by risk, which is represented by the ProsperScore. The bottom of the scatter plot is dominated by loans with a ProsoperScore equal to 11, which represents loans with low risks. The top of the scatter plot is dominated by loans with a ProsoperScore equal to 4, which represents loans with higher risks. Loans with a loan amount higher than $25000 are mostly dominated by a ProsperScore equal or superior to 0.

But is ProsperScore good at identifying if the loan is going to default or not?

## 
## FALSE  TRUE 
## 83208   982

Based on the boxplot above, it doesn’t seem that loans that defaulted and loans that didn’t default have a very different Prosper Scores.

Multivariate Analysis

We have observed that the Prosper score has an influence on montly payments for a same loan origination amount and same term. This Score should be a way to quantify the probability that the loan is going to default or not. However when we look at the Prosper Score it doesn’t seem be a great indicator for the investor if he should invest or not.

Final Plots and Summary

Plot One

Description One

The plots above describe how Prosper’s porfolio has involved between 2009 and 2014. The portfolio has drastically changed between 2009 and 2014. In 2009, Prosper issued only 36-month loans. In 2011, Prosper started to issue 12-month and 60-months loan, but stoped issuing 12-month loans in 2013. It seems that Prosper has really changed their underwriting strategies. Between 2011 and 2013, the average loan origination amount has drastically increased, which could expose the lender to a higher risk of loss. It’s perhaps one of the reasons why Prosper stopped issuing 12-month loans.

Most of loans are 36-month loans, and the average loan origination amount has increased drastically between 2009 and 2014.

The 60-month loans started with a very high loan origination amount in 2011, dropped in 2012, then increased steadily in 2013 and 2014.

Plot Two

Description Two

If we look at 36-month loans by monthly loan payment and loan origination amount, we notice that there is a linear relation. However, the monthly payment is influenced by the Prosper score for a same loan orgination amount and same term. We can also identify a couple of outliers where, for example, a same loan orignation amount and montly loan payment yields a Proser score around 10 instead of being around 2 or 3. It means that in order to estimate a monthly payment for a same loan origination amount and term, Prosper uses factors other than just those used for creating of the Prosper Score.

Plot Three

Description Three

The plot above displays a box plot for Prosper Score by the boolean value, defaulted or not. The third quartile for the loans that defaulted is a little bit lower than the loans that didn’t default. However the first quartile and median are equivalent for loans that defaulted and loans that didn’t default. So it seems that the Prosper score is not a good variable to use to determine if an investor should invest or not in a specific loan.

Reflection

We have been able to analyze that Prosper’s portfolio has really changed between 2009 and 2014.

It seems that Prosper doesn’t only use credit scores to assess risk but also several other variables that could be in the dataset, in addition to external data not included in the dataset available.

Instead of just using the Prosper score to decide whether or not to invest, it would be interesting to build a predictive model that quantifies if the person is going to default on their loan. We could use, for example, a logistic regression. However there are lots of variables that are correlated, such as loan origination amount, credit score, and Prosper score. It’s important to carefully choose which variables should be included in the model.